DOMAIN: Telecom
CONTEXT: A telecom company wants to use their historical customer data to predict behaviour to retain customers. You can analyse all relevant customer data and develop focused customer retention programs.
DATA DESCRIPTION: Each row represents a customer, each column contains customer’s attributes described on the column Metadata. The data set includes information about:
Customers who left within the last month – the column is called Churn
Services that each customer has signed up for – phone, multiple lines, internet, online security, online backup, device protection, tech support, and streaming TV and movies
Customer account information – how long they’ve been a customer, contract, payment method, paperless billing, monthly charges, and total charges
Demographic info about customers – gender, age range, and if they have partners and dependents
PROJECT OBJECTIVE: Build a model that will help to identify the potential customers who have a higher probability to churn. This help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.
Steps to the project:
Here we are importing all the Libraries and Modules that are needed for whole project in a single cell.
# Libraries for Basic Process
import numpy as np
import pandas as pd
# Libraries for Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
# Pre-setting Plot Style
font={'size':15}
plt.rc('font', **font)
plt.rc('xtick',labelsize=12)
plt.rc('ytick',labelsize=12)
sns.set_style({'xtick.bottom':True,'ytick.left':True,'text.color':'#9400D3',
'axes.labelcolor': 'blue','patch.edgecolor': 'black'})
# sklearn Modules
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn import metrics
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier, BaggingClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.ensemble import ExtraTreesClassifier, VotingClassifier
# Supporting Modules and Libraries
from imblearn.over_sampling import SMOTE
import pickle
# Module to Suppress Warnings
from warnings import filterwarnings
filterwarnings('ignore')
# Loading the file and creating dataframe
teledata = pd.read_csv('TelcomCustomer-Churn.csv')
# Getting Shape and Size of data
T = teledata.shape
# Displaying the Dataset
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',T[0],'\n Number of Columns =',T[1],'\n\n\033[1mDataset:-\033[0m')
display(teledata.head())
Dataset consist:- Number of Rows = 7043 Number of Columns = 21 Dataset:-
| customerID | gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | ... | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 7590-VHVEG | Female | 0 | Yes | No | 1 | No | No phone service | DSL | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 29.85 | 29.85 | No |
| 1 | 5575-GNVDE | Male | 0 | No | No | 34 | Yes | No | DSL | Yes | ... | Yes | No | No | No | One year | No | Mailed check | 56.95 | 1889.5 | No |
| 2 | 3668-QPYBK | Male | 0 | No | No | 2 | Yes | No | DSL | Yes | ... | No | No | No | No | Month-to-month | Yes | Mailed check | 53.85 | 108.15 | Yes |
| 3 | 7795-CFOCW | Male | 0 | No | No | 45 | No | No phone service | DSL | Yes | ... | Yes | Yes | No | No | One year | No | Bank transfer (automatic) | 42.30 | 1840.75 | No |
| 4 | 9237-HQITU | Female | 0 | No | No | 2 | Yes | No | Fiber optic | No | ... | No | No | No | No | Month-to-month | Yes | Electronic check | 70.70 | 151.65 | Yes |
5 rows × 21 columns
Since we have chosen 2nd approach that is, 'Data set for Direct import using Pandas', we have one one file which includes all the rows and columns. Hence there is no need to merge the data.
# Displaying the Final Dataset
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',T[0],'\n Number of Columns =',T[1],
'\n\n\033[1mTranspose of Dataset:-\033[0m')
display(teledata.head().T)
Dataset consist:- Number of Rows = 7043 Number of Columns = 21 Transpose of Dataset:-
| 0 | 1 | 2 | 3 | 4 | |
|---|---|---|---|---|---|
| customerID | 7590-VHVEG | 5575-GNVDE | 3668-QPYBK | 7795-CFOCW | 9237-HQITU |
| gender | Female | Male | Male | Male | Female |
| SeniorCitizen | 0 | 0 | 0 | 0 | 0 |
| Partner | Yes | No | No | No | No |
| Dependents | No | No | No | No | No |
| tenure | 1 | 34 | 2 | 45 | 2 |
| PhoneService | No | Yes | Yes | No | Yes |
| MultipleLines | No phone service | No | No | No phone service | No |
| InternetService | DSL | DSL | DSL | DSL | Fiber optic |
| OnlineSecurity | No | Yes | Yes | Yes | No |
| OnlineBackup | Yes | No | Yes | No | No |
| DeviceProtection | No | Yes | No | Yes | No |
| TechSupport | No | No | No | Yes | No |
| StreamingTV | No | No | No | No | No |
| StreamingMovies | No | No | No | No | No |
| Contract | Month-to-month | One year | Month-to-month | One year | Month-to-month |
| PaperlessBilling | Yes | No | Yes | No | Yes |
| PaymentMethod | Electronic check | Mailed check | Mailed check | Bank transfer (automatic) | Electronic check |
| MonthlyCharges | 29.85 | 56.95 | 53.85 | 42.3 | 70.7 |
| TotalCharges | 29.85 | 1889.5 | 108.15 | 1840.75 | 151.65 |
| Churn | No | No | Yes | No | Yes |
Key Observations:-
This Dataset will be used to Automate the Data Cleansing process.
# Making a copy of original dataset for future use
copy_teledata = teledata.copy() # for Automation purpose
copy_teledata2 = teledata.copy() # for comaparison purpose
# Checking for Null Values in the Attributes
print('\n\033[1mNull Values in the Features:-')
display(teledata.isnull().sum().to_frame('Null Values'))
Null Values in the Features:-
| Null Values | |
|---|---|
| customerID | 0 |
| gender | 0 |
| SeniorCitizen | 0 |
| Partner | 0 |
| Dependents | 0 |
| tenure | 0 |
| PhoneService | 0 |
| MultipleLines | 0 |
| InternetService | 0 |
| OnlineSecurity | 0 |
| OnlineBackup | 0 |
| DeviceProtection | 0 |
| TechSupport | 0 |
| StreamingTV | 0 |
| StreamingMovies | 0 |
| Contract | 0 |
| PaperlessBilling | 0 |
| PaymentMethod | 0 |
| MonthlyCharges | 0 |
| TotalCharges | 0 |
| Churn | 0 |
Key Observations:-
# Getting Quantitative Attributes
qt_data = teledata[['tenure','MonthlyCharges','TotalCharges']]
# Getting Data type of each Quantitative Attribute
QT = qt_data.dtypes.to_frame('Data Type')
QT.index.name = 'Quantitative Attributes'
# Displaying Data type of Quantitative Attributes
print('\n\033[1mData Types of Quantitative Attributes:-')
display(QT)
Data Types of Quantitative Attributes:-
| Data Type | |
|---|---|
| Quantitative Attributes | |
| tenure | int64 |
| MonthlyCharges | float64 |
| TotalCharges | object |
Key Observations:-
# Converting Datatype of TotalCharges to Numerical Float datatype
teledata['TotalCharges'] = pd.to_numeric(teledata['TotalCharges'],errors='coerce')
# Displaying Data type of Quantitative Attributes
print('\n\033[1mData Types of Quantitative Attributes:-')
qt_data = teledata[['tenure','MonthlyCharges','TotalCharges']]
QT = qt_data.dtypes.to_frame('Data Type')
QT.index.name = 'Quantitative Attributes'
display(QT)
Data Types of Quantitative Attributes:-
| Data Type | |
|---|---|
| Quantitative Attributes | |
| tenure | int64 |
| MonthlyCharges | float64 |
| TotalCharges | float64 |
Key Observations:-
# Checking for Null Values in the Attributes
print('\n\033[1mNull Values in the Features:-')
display(teledata.isnull().sum().to_frame('Null Values')[19:20])
Null Values in the Features:-
| Null Values | |
|---|---|
| TotalCharges | 11 |
Key Observations:-
# Dropping Null Values
teledata.dropna(inplace=True)
# Checking for Null Values After Dropping
print('\n\033[1mNull Values in the Features:-')
display(teledata.isnull().sum().to_frame('Null Values')[19:20])
# Displaying Shape and Size of Dataset After Dropping
T = teledata.shape
print('\033[1m\n\nDataset After Dropping Values consist:-\033[0m\n Number of Rows =',T[0],'\n Number of Columns =',T[1])
Null Values in the Features:-
| Null Values | |
|---|---|
| TotalCharges | 0 |
Dataset After Dropping Values consist:-
Number of Rows = 7032
Number of Columns = 21
Key Observations:-
# Getting Categorical Attributes
cat_data = teledata.drop(columns=['customerID','tenure','MonthlyCharges','TotalCharges'])
# Getting Unique Values in Categorical attributes
CD = cat_data.apply(lambda col: col.unique()).to_frame('Unique Values')
CD['Total Unique Values'] = cat_data.apply(lambda col: col.nunique())
CD['Data Type'] = cat_data.dtypes
# Displaying Unique Values of Categorical attributes
CD.index.name = 'Categorical Attributes'
print('\n\033[1mTable showing Data Types and Unique values of Categorical Attributes:-')
display(CD)
Table showing Data Types and Unique values of Categorical Attributes:-
| Unique Values | Total Unique Values | Data Type | |
|---|---|---|---|
| Categorical Attributes | |||
| gender | [Female, Male] | 2 | object |
| SeniorCitizen | [0, 1] | 2 | int64 |
| Partner | [Yes, No] | 2 | object |
| Dependents | [No, Yes] | 2 | object |
| PhoneService | [No, Yes] | 2 | object |
| MultipleLines | [No phone service, No, Yes] | 3 | object |
| InternetService | [DSL, Fiber optic, No] | 3 | object |
| OnlineSecurity | [No, Yes, No internet service] | 3 | object |
| OnlineBackup | [Yes, No, No internet service] | 3 | object |
| DeviceProtection | [No, Yes, No internet service] | 3 | object |
| TechSupport | [No, Yes, No internet service] | 3 | object |
| StreamingTV | [No, Yes, No internet service] | 3 | object |
| StreamingMovies | [No, Yes, No internet service] | 3 | object |
| Contract | [Month-to-month, One year, Two year] | 3 | object |
| PaperlessBilling | [Yes, No] | 2 | object |
| PaymentMethod | [Electronic check, Mailed check, Bank transfer... | 4 | object |
| Churn | [No, Yes] | 2 | object |
Key Observations:-
In attribute 'MultipleLines' we have three unique values
Here we need to change the values "No phone service" to "No".
Similarly in attributes
we have three unique values,
Here we need to change the values "No internet service" to "No".
# Creating Replacing data
Replace = {'gender':{'Female':0,'Male':1},
'InternetService':{'No':0,'DSL':1,'Fiber optic':2},
'Contract':{'Month-to-month':0,'One year':1,'Two year':2},
'PaymentMethod':{'Electronic check':0,'Mailed check':1,'Bank transfer (automatic)':2,'Credit card (automatic)':3},
'Partner':{'No':0,'Yes':1},
'Dependents':{'No':0,'Yes':1},
'PhoneService':{'No':0,'Yes':1},
'PaperlessBilling':{'No':0,'Yes':1},
'Churn':{'No':0,'Yes':1},
'MultipleLines':{'No':0,'Yes':1,'No phone service':0},
'OnlineSecurity':{'No':0,'Yes':1,'No internet service':0},
'OnlineBackup':{'No':0,'Yes':1,'No internet service':0},
'DeviceProtection':{'No':0,'Yes':1,'No internet service':0},
'TechSupport':{'No':0,'Yes':1,'No internet service':0},
'StreamingTV':{'No':0,'Yes':1,'No internet service':0},
'StreamingMovies':{'No':0,'Yes':1,'No internet service':0},
}
# Performing Replace operation to convert Categorical Attributes to Continuous form
teledata = teledata.replace(Replace)
# Displaying Unique values along with datatypes of Categorical Attributes after converting them to Continuous form
print('\n\033[1mTable showing Data Types and Unique values of Categorical Attributes after Converting them to Continuous:-')
cat_data = teledata.drop(columns=['customerID','tenure','MonthlyCharges','TotalCharges'])
CD = cat_data.apply(lambda col: col.unique()).to_frame('Unique Values')
CD['Total Unique Values'] = cat_data.apply(lambda col: col.nunique())
CD['Data Type'] = cat_data.dtypes
CD.index.name = 'Categorical Attributes'
display(CD)
Table showing Data Types and Unique values of Categorical Attributes after Converting them to Continuous:-
| Unique Values | Total Unique Values | Data Type | |
|---|---|---|---|
| Categorical Attributes | |||
| gender | [0, 1] | 2 | int64 |
| SeniorCitizen | [0, 1] | 2 | int64 |
| Partner | [1, 0] | 2 | int64 |
| Dependents | [0, 1] | 2 | int64 |
| PhoneService | [0, 1] | 2 | int64 |
| MultipleLines | [0, 1] | 2 | int64 |
| InternetService | [1, 2, 0] | 3 | int64 |
| OnlineSecurity | [0, 1] | 2 | int64 |
| OnlineBackup | [1, 0] | 2 | int64 |
| DeviceProtection | [0, 1] | 2 | int64 |
| TechSupport | [0, 1] | 2 | int64 |
| StreamingTV | [0, 1] | 2 | int64 |
| StreamingMovies | [0, 1] | 2 | int64 |
| Contract | [0, 1, 2] | 3 | int64 |
| PaperlessBilling | [1, 0] | 2 | int64 |
| PaymentMethod | [0, 1, 2, 3] | 4 | int64 |
| Churn | [0, 1] | 2 | int64 |
Key Observations:-
For further analysis, we first differentiate between different types of Attributes.
Numeric Attributes
Continuous Attribute
Since customerID do not give much information and is not helpful for our further process, we are dropping this attribute.
# Dropping customerID Attribute
teledata.drop(['customerID'],axis=1,inplace=True)
# Getting Shape and Size of dataset
T = teledata.shape
# Displaying the Dataset after dropping customerID Attribute
print('\033[1mDataset consist:-\033[0m\n Number of Rows =',T[0],'\n Number of Columns =',T[1],'\n\n\033[1mDataset:-\033[0m')
display(teledata.head())
Dataset consist:- Number of Rows = 7032 Number of Columns = 20 Dataset:-
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 29.85 | 29.85 | 0 |
| 1 | 1 | 0 | 0 | 0 | 34 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 56.95 | 1889.50 | 0 |
| 2 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 53.85 | 108.15 | 1 |
| 3 | 1 | 0 | 0 | 0 | 45 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 42.30 | 1840.75 | 0 |
| 4 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 70.70 | 151.65 | 1 |
Key Observations:-
# Automating for Datatype Handling, Dropping Attributes and Converting Categorical to Continuous form of Dataset
for i in copy_teledata.columns:
if (copy_teledata[i].dtype=='O'):
if i == 'TotalCharges':
copy_teledata[i] = pd.to_numeric(copy_teledata[i],errors='coerce')
copy_teledata.dropna(inplace=True)
elif i == 'customerID':
copy_teledata.drop([i],axis=1,inplace=True)
elif (i=='gender')|(i=='InternetService' )|(i=='Contract')|(i=='PaymentMethod'):
copy_teledata[i] = copy_teledata[i].astype('category')
copy_teledata[i] = np.int64(copy_teledata[i].cat.codes)
else:
copy_teledata[i] = np.int64(np.where(copy_teledata[i].str.contains("No"),0,1))
# Displaying copy_teledata after Automation
CT = copy_teledata.shape
print('\033[1mDataset After Automation consist:-\033[0m\n Number of Rows =',CT[0],'\n Number of Columns =',CT[1])
print('\n\033[1m\n\nDataset After Automation:-')
display(copy_teledata.head())
Dataset After Automation consist:- Number of Rows = 7032 Number of Columns = 20 Dataset After Automation:-
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 29.85 | 29.85 | 0 |
| 1 | 1 | 0 | 0 | 0 | 34 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 3 | 56.95 | 1889.50 | 0 |
| 2 | 1 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 3 | 53.85 | 108.15 | 1 |
| 3 | 1 | 0 | 0 | 0 | 45 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 42.30 | 1840.75 | 0 |
| 4 | 0 | 0 | 0 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2 | 70.70 | 151.65 | 1 |
Key Observations:-
# Detailed Comparison between Before Data Cleasing, Manual Data Cleasing and Automated Data Cleasing Procedures
print('\n\033[1mTables showing Comparison between Before, Manual and Automated Data Cleasing Procedures:-')
# For Quantitative Data
print('\n\033[1m\n1. Data Types and Total Rows of Quantitative Attributes:-')
Original = teledata[['tenure','MonthlyCharges','TotalCharges']]
Cpy1 = copy_teledata[['tenure','MonthlyCharges','TotalCharges']]
Cpy2 = copy_teledata2[['tenure','MonthlyCharges','TotalCharges']]
D1 = pd.DataFrame({'Quantitative Attributes':teledata[['tenure','MonthlyCharges','TotalCharges']].columns,
'Data Type': Cpy2.dtypes, 'Data Types': Original.dtypes, 'Data type':Cpy1.dtypes})
D1 = D1.append({'Quantitative Attributes':'____________________','Data Type': '___________________',
'Data Types':'___________________', 'Data type':'___________________'},ignore_index=True)
D1 = D1.append({'Quantitative Attributes':'Total Rows','Data Type': Cpy2.shape[0], 'Data Types': Original.shape[0],
'Data type':Cpy1.shape[0]},ignore_index=True)
D1 = D1.append({'Quantitative Attributes':'Total Columns','Data Type': Cpy2.shape[1], 'Data Types': Original.shape[1],
'Data type':Cpy1.shape[1]},ignore_index=True)
D1 = D1.set_index('Quantitative Attributes')
column1=[('Before Data cleansing','Data Type'),('Manual Data cleansing','Data Types'),('Automated Data cleansing','Data type')]
D1.columns = pd.MultiIndex.from_tuples(column1)
display(D1)
# For Categorical Data
print('\n\033[1m\n\n2. Unique Values and Data Types of Categorical Attributes:-')
Original = teledata.drop(columns=['tenure','MonthlyCharges','TotalCharges']).copy() #copy() used to avoid change of index name
Cpy1 = copy_teledata.drop(columns=['tenure','MonthlyCharges','TotalCharges'])
Cpy2 = copy_teledata2.drop(columns=['customerID','tenure','MonthlyCharges','TotalCharges'])
D2 = Cpy2.apply(lambda col: col.unique()).to_frame('Unique Value')
D2['Data Types'] = Cpy2.dtypes
D2['Unique Values'] = Original.apply(lambda col: col.unique()).to_frame()
D2['Data Type'] = Original.dtypes
D2['Unique values'] = Cpy1.apply(lambda col: col.unique()).to_frame()
D2['Data type'] = Cpy1.dtypes
D2.index.name = 'Categorical Attributes' #It will not affect original data since we used copy() earlier
column2=[('_______________Before Data cleansing_______________', 'Unique Value'),
('_______________Before Data cleansing_______________','Data Types'),('__Manual Data cleansing__','Unique Values'),
('__Manual Data cleansing__','Data Type'),('__Automated Data cleansing__','Unique values'),
('__Automated Data cleansing__','Data type')]
D2.columns = pd.MultiIndex.from_tuples(column2)
display(D2)
Tables showing Comparison between Before, Manual and Automated Data Cleasing Procedures:- 1. Data Types and Total Rows of Quantitative Attributes:-
| Before Data cleansing | Manual Data cleansing | Automated Data cleansing | |
|---|---|---|---|
| Data Type | Data Types | Data type | |
| Quantitative Attributes | |||
| tenure | int64 | int64 | int64 |
| MonthlyCharges | float64 | float64 | float64 |
| TotalCharges | object | float64 | float64 |
| ____________________ | ___________________ | ___________________ | ___________________ |
| Total Rows | 7043 | 7032 | 7032 |
| Total Columns | 3 | 3 | 3 |
2. Unique Values and Data Types of Categorical Attributes:-
| _______________Before Data cleansing_______________ | __Manual Data cleansing__ | __Automated Data cleansing__ | ||||
|---|---|---|---|---|---|---|
| Unique Value | Data Types | Unique Values | Data Type | Unique values | Data type | |
| Categorical Attributes | ||||||
| gender | [Female, Male] | object | [0, 1] | int64 | [0, 1] | int64 |
| SeniorCitizen | [0, 1] | int64 | [0, 1] | int64 | [0, 1] | int64 |
| Partner | [Yes, No] | object | [1, 0] | int64 | [1, 0] | int64 |
| Dependents | [No, Yes] | object | [0, 1] | int64 | [0, 1] | int64 |
| PhoneService | [No, Yes] | object | [0, 1] | int64 | [0, 1] | int64 |
| MultipleLines | [No phone service, No, Yes] | object | [0, 1] | int64 | [0, 1] | int64 |
| InternetService | [DSL, Fiber optic, No] | object | [1, 2, 0] | int64 | [0, 1, 2] | int64 |
| OnlineSecurity | [No, Yes, No internet service] | object | [0, 1] | int64 | [0, 1] | int64 |
| OnlineBackup | [Yes, No, No internet service] | object | [1, 0] | int64 | [1, 0] | int64 |
| DeviceProtection | [No, Yes, No internet service] | object | [0, 1] | int64 | [0, 1] | int64 |
| TechSupport | [No, Yes, No internet service] | object | [0, 1] | int64 | [0, 1] | int64 |
| StreamingTV | [No, Yes, No internet service] | object | [0, 1] | int64 | [0, 1] | int64 |
| StreamingMovies | [No, Yes, No internet service] | object | [0, 1] | int64 | [0, 1] | int64 |
| Contract | [Month-to-month, One year, Two year] | object | [0, 1, 2] | int64 | [0, 1, 2] | int64 |
| PaperlessBilling | [Yes, No] | object | [1, 0] | int64 | [1, 0] | int64 |
| PaymentMethod | [Electronic check, Mailed check, Bank transfer... | object | [0, 1, 2, 3] | int64 | [2, 3, 0, 1] | int64 |
| Churn | [No, Yes] | object | [0, 1] | int64 | [0, 1] | int64 |
Key Observations:-
# Describing the data interms of count, mean, standard deviation, and 5 point summary
print('\n\033[1mBrief Summary of Dataset:-')
display(teledata.describe()[1:].T)
Brief Summary of Dataset:-
| mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|
| gender | 0.504693 | 0.500014 | 0.00 | 0.0000 | 1.000 | 1.0000 | 1.00 |
| SeniorCitizen | 0.162400 | 0.368844 | 0.00 | 0.0000 | 0.000 | 0.0000 | 1.00 |
| Partner | 0.482509 | 0.499729 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| Dependents | 0.298493 | 0.457629 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| tenure | 32.421786 | 24.545260 | 1.00 | 9.0000 | 29.000 | 55.0000 | 72.00 |
| PhoneService | 0.903299 | 0.295571 | 0.00 | 1.0000 | 1.000 | 1.0000 | 1.00 |
| MultipleLines | 0.421928 | 0.493902 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| InternetService | 1.224118 | 0.778643 | 0.00 | 1.0000 | 1.000 | 2.0000 | 2.00 |
| OnlineSecurity | 0.286547 | 0.452180 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| OnlineBackup | 0.344852 | 0.475354 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| DeviceProtection | 0.343857 | 0.475028 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| TechSupport | 0.290102 | 0.453842 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| StreamingTV | 0.384386 | 0.486484 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| StreamingMovies | 0.388367 | 0.487414 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
| Contract | 0.688567 | 0.832934 | 0.00 | 0.0000 | 0.000 | 1.0000 | 2.00 |
| PaperlessBilling | 0.592719 | 0.491363 | 0.00 | 0.0000 | 1.000 | 1.0000 | 1.00 |
| PaymentMethod | 1.315557 | 1.149523 | 0.00 | 0.0000 | 1.000 | 2.0000 | 3.00 |
| MonthlyCharges | 64.798208 | 30.085974 | 18.25 | 35.5875 | 70.350 | 89.8625 | 118.75 |
| TotalCharges | 2283.300441 | 2266.771362 | 18.80 | 401.4500 | 1397.475 | 3794.7375 | 8684.80 |
| Churn | 0.265785 | 0.441782 | 0.00 | 0.0000 | 0.000 | 1.0000 | 1.00 |
# Checking skewness of the data attributes
print('\033[1m\nSkewness of all attributes:-')
display(teledata.skew().to_frame(name='Skewness'))
Skewness of all attributes:-
| Skewness | |
|---|---|
| gender | -0.018776 |
| SeniorCitizen | 1.831103 |
| Partner | 0.070024 |
| Dependents | 0.880908 |
| tenure | 0.237731 |
| PhoneService | -2.729727 |
| MultipleLines | 0.316232 |
| InternetService | -0.412648 |
| OnlineSecurity | 0.944373 |
| OnlineBackup | 0.652954 |
| DeviceProtection | 0.657594 |
| TechSupport | 0.925245 |
| StreamingTV | 0.475441 |
| StreamingMovies | 0.458191 |
| Contract | 0.635149 |
| PaperlessBilling | -0.377503 |
| PaymentMethod | 0.218108 |
| MonthlyCharges | -0.222103 |
| TotalCharges | 0.961642 |
| Churn | 1.060622 |
# Checking Variance data attributes
print('\033[1m\nVariance of all attributes:-')
display(teledata.var().to_frame(name='Variance'))
Variance of all attributes:-
| Variance | |
|---|---|
| gender | 2.500135e-01 |
| SeniorCitizen | 1.360459e-01 |
| Partner | 2.497296e-01 |
| Dependents | 2.094246e-01 |
| tenure | 6.024698e+02 |
| PhoneService | 8.736218e-02 |
| MultipleLines | 2.439395e-01 |
| InternetService | 6.062850e-01 |
| OnlineSecurity | 2.044670e-01 |
| OnlineBackup | 2.259613e-01 |
| DeviceProtection | 2.256513e-01 |
| TechSupport | 2.059723e-01 |
| StreamingTV | 2.366670e-01 |
| StreamingMovies | 2.375720e-01 |
| Contract | 6.937791e-01 |
| PaperlessBilling | 2.414375e-01 |
| PaymentMethod | 1.321402e+00 |
| MonthlyCharges | 9.051658e+02 |
| TotalCharges | 5.138252e+06 |
| Churn | 1.951711e-01 |
# Checking Covariance related with all attributes
print('\033[1mCovariance between all attributes:-')
display(teledata.cov())
Covariance between all attributes:-
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | Churn | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| gender | 0.250014 | -0.000336 | -0.000345 | 0.002368 | 0.064867 | -0.001111 | -0.002194 | -0.003754 | -0.003692 | -0.003112 | -0.000192 | -0.001931 | -0.001733 | -0.002463 | 0.000039 | -0.002924 | -0.002832 | -0.207288 | 5.422208e-02 | -0.001887 |
| SeniorCitizen | -0.000336 | 0.136046 | 0.003125 | -0.035540 | 0.141988 | 0.000915 | 0.026050 | 0.074393 | -0.006434 | 0.011688 | 0.010427 | -0.010140 | 0.018921 | 0.021545 | -0.043570 | 0.028320 | -0.039734 | 2.439951 | 8.562397e+01 | 0.024530 |
| Partner | -0.000345 | 0.003125 | 0.249730 | 0.103430 | 4.684523 | 0.002717 | 0.035187 | 0.000365 | 0.032392 | 0.033696 | 0.036452 | 0.027262 | 0.030263 | 0.028768 | 0.122414 | -0.003427 | 0.076563 | 1.470784 | 3.614364e+02 | -0.033112 |
| Dependents | 0.002368 | -0.035540 | 0.103430 | 0.209425 | 1.835254 | -0.000146 | -0.005494 | -0.063351 | 0.016717 | 0.005142 | 0.003022 | 0.013096 | -0.003673 | -0.008560 | 0.091694 | -0.024764 | 0.065232 | -1.546763 | 6.706746e+01 | -0.032980 |
| tenure | 0.064867 | 0.141988 | 4.684523 | 1.835254 | 602.469774 | 0.057149 | 4.029663 | 0.597677 | 3.643735 | 4.213649 | 4.215207 | 3.623604 | 3.346595 | 3.414465 | 13.835544 | 0.058170 | 9.619692 | 182.299526 | 4.595074e+04 | -3.839186 |
| PhoneService | -0.001111 | 0.000915 | 0.002717 | -0.000146 | 0.057149 | 0.087362 | 0.040807 | 0.021676 | -0.012253 | -0.007325 | -0.009839 | -0.012762 | -0.003075 | -0.004823 | 0.000743 | 0.002425 | -0.001055 | 2.205644 | 7.571460e+01 | 0.001527 |
| MultipleLines | -0.002194 | 0.026050 | 0.035187 | -0.005494 | 4.029663 | 0.040807 | 0.243940 | 0.132704 | 0.022019 | 0.047479 | 0.047330 | 0.022510 | 0.061944 | 0.062397 | 0.044236 | 0.039739 | 0.020444 | 7.294726 | 5.251225e+02 | 0.008735 |
| InternetService | -0.003754 | 0.074393 | 0.000365 | -0.063351 | 0.597677 | 0.021676 | 0.132704 | 0.606285 | 0.055099 | 0.113713 | 0.115927 | 0.058142 | 0.162738 | 0.161987 | -0.187339 | 0.144485 | -0.159909 | 21.209853 | 7.557967e+02 | 0.108821 |
| OnlineSecurity | -0.003692 | -0.006434 | 0.032392 | 0.016717 | 3.643735 | -0.012253 | 0.022019 | 0.055099 | 0.204467 | 0.060891 | 0.059043 | 0.072741 | 0.038609 | 0.041308 | 0.092524 | -0.000900 | 0.084647 | 4.032948 | 4.229298e+02 | -0.034214 |
| OnlineBackup | -0.003112 | 0.011688 | 0.033696 | 0.005142 | 4.213649 | -0.007325 | 0.047479 | 0.113713 | 0.060891 | 0.225961 | 0.068432 | 0.063362 | 0.065121 | 0.063605 | 0.061474 | 0.029677 | 0.052592 | 6.314521 | 5.496425e+02 | -0.017285 |
| DeviceProtection | -0.000192 | 0.010427 | 0.036452 | 0.003022 | 4.215207 | -0.009839 | 0.047330 | 0.115927 | 0.059043 | 0.068432 | 0.225651 | 0.071758 | 0.090109 | 0.093149 | 0.086907 | 0.024293 | 0.060586 | 6.897260 | 5.630279e+02 | -0.013891 |
| TechSupport | -0.001931 | -0.010140 | 0.027262 | 0.013096 | 3.623604 | -0.012762 | 0.022510 | 0.058142 | 0.072741 | 0.063362 | 0.071758 | 0.205972 | 0.061279 | 0.061973 | 0.111126 | 0.008371 | 0.087223 | 4.619258 | 4.453157e+02 | -0.033025 |
| StreamingTV | -0.001733 | 0.018921 | 0.030263 | -0.003673 | 3.346595 | -0.003075 | 0.061944 | 0.162738 | 0.038609 | 0.065121 | 0.090109 | 0.061279 | 0.236667 | 0.126475 | 0.042214 | 0.053603 | -0.007958 | 9.216042 | 5.686975e+02 | 0.013595 |
| StreamingMovies | -0.002463 | 0.021545 | 0.028768 | -0.008560 | 3.414465 | -0.004823 | 0.062397 | 0.161987 | 0.041308 | 0.063605 | 0.093149 | 0.061973 | 0.126475 | 0.237572 | 0.044307 | 0.050673 | -0.002388 | 9.197965 | 5.743772e+02 | 0.013105 |
| Contract | 0.000039 | -0.043570 | 0.122414 | 0.091694 | 13.835544 | 0.000743 | 0.044236 | -0.187339 | 0.092524 | 0.061474 | 0.086907 | 0.111126 | 0.042214 | 0.044307 | 0.693779 | -0.071817 | 0.344200 | -1.822802 | 8.502091e+02 | -0.145773 |
| PaperlessBilling | -0.002924 | 0.028320 | -0.003427 | -0.024764 | 0.058170 | 0.002425 | 0.039739 | 0.144485 | -0.000900 | 0.029677 | 0.024293 | 0.008371 | 0.053603 | 0.050673 | -0.071817 | 0.241438 | -0.057494 | 5.202634 | 1.757920e+02 | 0.041560 |
| PaymentMethod | -0.002832 | -0.039734 | 0.076563 | 0.065232 | 9.619692 | -0.001055 | 0.020444 | -0.159909 | 0.084647 | 0.052592 | 0.060586 | 0.087223 | -0.007958 | -0.002388 | 0.344200 | -0.057494 | 1.321402 | -2.581372 | 5.802759e+02 | -0.133520 |
| MonthlyCharges | -0.207288 | 2.439951 | 1.470784 | -1.546763 | 182.299526 | 2.205644 | 7.294726 | 21.209853 | 4.032948 | 6.314521 | 6.897260 | 4.619258 | 9.216042 | 9.197965 | -1.822802 | 5.202634 | -2.581372 | 905.165825 | 4.440133e+04 | 2.563362 |
| TotalCharges | 0.054222 | 85.623972 | 361.436397 | 67.067462 | 45950.743236 | 75.714600 | 525.122521 | 755.796694 | 422.929805 | 549.642473 | 563.027938 | 445.315652 | 568.697512 | 574.377172 | 850.209098 | 175.791980 | 580.275874 | 44401.333073 | 5.138252e+06 | -199.766978 |
| Churn | -0.001887 | 0.024530 | -0.033112 | -0.032980 | -3.839186 | 0.001527 | 0.008735 | 0.108821 | -0.034214 | -0.017285 | -0.013891 | -0.033025 | 0.013595 | 0.013105 | -0.145773 | 0.041560 | -0.133520 | 2.563362 | -1.997670e+02 | 0.195171 |
# Checking Correlation by plotting Heatmap for all attributes
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(22,18))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(teledata.corr(),annot=True,fmt= '.2f',cmap='magma');
plt.show()
Heatmap showing Correlation of Data attributes:-
Key Observations:-
# Getting Percentage of Loyal Customer
print('\033[1mPercentage of Loyal Customer =',round(teledata[teledata['Churn']==0].shape[0]/teledata.shape[0]*100,2),'%')
Percentage of Loyal Customer = 73.42 %
# Getting Highest Loyality of Customer
print('\033[1mHighest Tenure:-\n > Loyal Customer =',teledata[teledata['Churn']==0]['tenure'].max())
print('\033[1m > Not so Loyal Customer =',teledata[teledata['Churn']==0]['tenure'].max())
Highest Tenure:- > Loyal Customer = 72 > Not so Loyal Customer = 72
# Getting Lowest Loyality of Customer
print('\033[1mLowest Tenure:-\n > Loyal Customer =',teledata[teledata['Churn']==0]['tenure'].min())
print('\033[1m > Not so Loyal Customer =',teledata[teledata['Churn']==0]['tenure'].min())
Lowest Tenure:- > Loyal Customer = 1 > Not so Loyal Customer = 1
# Getting Average Loyality of Customer
print('\033[1mAverage Tenure:-\n > Loyal Customer =',round(teledata[teledata['Churn']==0]['tenure'].mean(),2))
print('\033[1m > Not so Loyal Customer =',round(teledata[teledata['Churn']==1]['tenure'].mean(),2))
print('\033[1m _____________________ ____')
# Getting Difference in Averages
print('\033[1m Difference =',round(round(teledata[teledata['Churn']==0]['tenure'].mean(),2)-
round(teledata[teledata['Churn']==1]['tenure'].mean(),2),2))
Average Tenure:- > Loyal Customer = 37.65 > Not so Loyal Customer = 17.98 _____________________ ____ Difference = 19.67
# Visualizing Relation between Tenure and Churn
print(f'\033[1mPlot Showing Relation between Tenure and Churn:-')
plt.figure(figsize=(10,8))
plt.title(f'Relation between Tenure and Churn\n')
sns.histplot(data=teledata, x = 'tenure', hue = 'Churn', palette='bright');
plt.show()
Plot Showing Relation between Tenure and Churn:-
Key Observations:-
Here we check gender based interest on churn.
# Getting Customers who do not churn
G = teledata[teledata['Churn']==0]
print('\033[1mCustomers who do not churn:-\n\n Females =',G[G['gender']==0].shape[0],'\n Males =',G[G['gender']==1].shape[0])
print('\033[1m _______ ____\n Total =',G.shape[0])
Customers who do not churn:- Females = 2544 Males = 2619 _______ ____ Total = 5163
# Getting Customers who churn
G = teledata[teledata['Churn']==1]
print('\033[1mCustomers who churn:-\n\n Females =',G[G['gender']==0].shape[0],'\n Males =',G[G['gender']==1].shape[0])
print('\033[1m _______ ____\n Total =',G.shape[0])
Customers who churn:- Females = 939 Males = 930 _______ ____ Total = 1869
# Visualizing Relation between Gender and Churn
print(f'\033[1mPlot Showing Relation between Gender and Churn:-')
plt.figure(figsize=(10,8))
plt.title(f'Relation between Gender and Churn\n')
sns.countplot(data=teledata, x = 'gender', hue = 'Churn', palette='bright');
plt.show()
Plot Showing Relation between Gender and Churn:-
Key Observations:-
# Getting Customers who do not churn
G = teledata[teledata['Churn']==0]
print('\033[1mCustomers who do not churn:-\n\n Not Senior Citizen =',G[G['SeniorCitizen']==0].shape[0],
'\n Senior Citizen =',G[G['SeniorCitizen']==1].shape[0])
print('\033[1m __________________ ____\n Total =',G.shape[0])
Customers who do not churn:- Not Senior Citizen = 4497 Senior Citizen = 666 __________________ ____ Total = 5163
# Getting Customers who churn
G = teledata[teledata['Churn']==1]
print('\033[1mCustomers who churn:-\n\n Not Senior Citizen =',G[G['SeniorCitizen']==0].shape[0],
'\n Senior Citizen =',G[G['SeniorCitizen']==1].shape[0])
print('\033[1m __________________ ____\n Total =',G.shape[0])
Customers who churn:- Not Senior Citizen = 1393 Senior Citizen = 476 __________________ ____ Total = 1869
G = teledata[teledata['Churn']==1]
G = G[G['SeniorCitizen']==1]
print('\033[1mSenior Citizens who churn:-\n\n Females =',G[G['gender']==0].shape[0],'\n Males =',G[G['gender']==1].shape[0])
print('\033[1m _______ ____\n Total =',G.shape[0])
Senior Citizens who churn:- Females = 240 Males = 236 _______ ____ Total = 476
# Visualizing Relation between SeniorCitizen and Churn
print(f'\033[1mPlot Showing Relation between SeniorCitizen and Churn:-')
plt.figure(figsize=(10,8))
plt.title(f'Relation between SeniorCitizen and Churn\n')
sns.countplot(data=teledata, x = 'SeniorCitizen', hue = 'Churn', palette='bright');
plt.show()
Plot Showing Relation between SeniorCitizen and Churn:-
Key Observations:-
Univariate analysis is the simplest form of analyzing data. It involves only one variable.
We will use these functions for easy analysis of individual attribute.
# Creating Plot function for Quantitative Attributes
def qt_data(x):
# Plotting Distribution for Quantitative attribute
print(f'\033[1mPlot Showing Distribution of Feature "{x}":-')
plt.figure(figsize=(12,6))
plt.title(f'Distribution of "{x}"\n')
sns.distplot(teledata[x],color='#9400D3');
print('')
plt.show()
print('\n__________________________________________________________________________________________________\n')
print('')
# Box plot for Quantitative data
print(f'\033[1mPlot Showing 5 point summary with outliers of Attribute "{x}":-\n')
plt.figure(figsize=(12,6))
plt.title(f'Box Plot for "{x}"\n')
sns.boxplot(teledata[x],color="#9400D3");
plt.show()
# Creating Plot function for Categorical Attributes
def cat_data(x):
# Plotting Frequency Distribution of categorical attribute
colors = ['gold','tomato','yellowgreen','#ADD8E6']
print(f'\033[1mPlot Showing Frequency Distribution of Attribute "{x}":-')
plt.figure(figsize=(10,8))
plt.title(f'Frequencies of "{x}" Attribute\n')
sns.countplot(teledata[x],palette='bright');
plt.show()
print('\n___________________________________________________________________________________')
print('')
# Plotting Pie Chart to check contribution of categorical attribute
print(f'\033[1m\nPie Chart Showing Contribution of Each Category of "{x}" feature:-\n')
plt.title(f'Contribution of Each Category of "{x}" Attribute\n\n\n\n\n\n')
teledata[x].value_counts().plot.pie(radius=2.5,shadow=True,autopct='%1.1f%%',colors=colors);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
# Univariate analysis fpr Gender Attribute
cat_data('gender')
Plot Showing Frequency Distribution of Attribute "gender":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "gender" feature:-
Key Observations:-
# Univariate analysis fpr SeniorCitizen Attribute
cat_data('SeniorCitizen')
Plot Showing Frequency Distribution of Attribute "SeniorCitizen":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "SeniorCitizen" feature:-
Key Observations:-
# Univariate analysis fpr Partner Attribute
cat_data('Partner')
Plot Showing Frequency Distribution of Attribute "Partner":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "Partner" feature:-
Key Observations:-
# Univariate analysis fpr Dependents Attribute
cat_data('Dependents')
Plot Showing Frequency Distribution of Attribute "Dependents":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "Dependents" feature:-
Key Observations:-
# Univariate analysis fpr Tenure Attribute
qt_data('tenure')
Plot Showing Distribution of Feature "tenure":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "tenure":-
Key Observations:-
# Univariate analysis fpr PhoneService Attribute
cat_data('PhoneService')
Plot Showing Frequency Distribution of Attribute "PhoneService":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "PhoneService" feature:-
Key Observations:-
# Univariate analysis fpr MultipleLines Attribute
cat_data('MultipleLines')
Plot Showing Frequency Distribution of Attribute "MultipleLines":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "MultipleLines" feature:-
Key Observations:-
# Univariate analysis fpr InternetService Attribute
cat_data('InternetService')
Plot Showing Frequency Distribution of Attribute "InternetService":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "InternetService" feature:-
Key Observations:-
# Univariate analysis fpr OnlineSecurity Attribute
cat_data('OnlineSecurity')
Plot Showing Frequency Distribution of Attribute "OnlineSecurity":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "OnlineSecurity" feature:-
Key Observations:-
# Univariate analysis fpr OnlineBackup Attribute
cat_data('OnlineBackup')
Plot Showing Frequency Distribution of Attribute "OnlineBackup":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "OnlineBackup" feature:-
Key Observations:-
# Univariate analysis fpr DeviceProtection Attribute
cat_data('DeviceProtection')
Plot Showing Frequency Distribution of Attribute "DeviceProtection":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "DeviceProtection" feature:-
Key Observations:-
# Univariate analysis fpr TechSupport Attribute
cat_data('TechSupport')
Plot Showing Frequency Distribution of Attribute "TechSupport":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "TechSupport" feature:-
Key Observations:-
# Univariate analysis fpr StreamingTV Attribute
cat_data('StreamingTV')
Plot Showing Frequency Distribution of Attribute "StreamingTV":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "StreamingTV" feature:-
Key Observations:-
# Univariate analysis fpr StreamingMovies Attribute
cat_data('StreamingMovies')
Plot Showing Frequency Distribution of Attribute "StreamingMovies":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "StreamingMovies" feature:-
Key Observations:-
# Univariate analysis fpr Contract Attribute
cat_data('Contract')
Plot Showing Frequency Distribution of Attribute "Contract":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "Contract" feature:-
Key Observations:-
# Univariate analysis fpr PaperlessBilling Attribute
cat_data('PaperlessBilling')
Plot Showing Frequency Distribution of Attribute "PaperlessBilling":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "PaperlessBilling" feature:-
Key Observations:-
# Univariate analysis fpr PaymentMethod Attribute
cat_data('PaymentMethod')
Plot Showing Frequency Distribution of Attribute "PaymentMethod":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "PaymentMethod" feature:-
Key Observations:-
PaymentMethod Attribute:-
Electronic check is highest payment method used by customers.
# Univariate analysis fpr MonthlyCharges Attribute
qt_data('MonthlyCharges')
Plot Showing Distribution of Feature "MonthlyCharges":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "MonthlyCharges":-
Key Observations:-
# Univariate analysis fpr TotalCharges Attribute
qt_data('TotalCharges')
Plot Showing Distribution of Feature "TotalCharges":-
__________________________________________________________________________________________________
Plot Showing 5 point summary with outliers of Attribute "TotalCharges":-
Key Observations:-
# Univariate analysis fpr Churn Attribute
cat_data('Churn')
Plot Showing Frequency Distribution of Attribute "Churn":-
___________________________________________________________________________________
Pie Chart Showing Contribution of Each Category of "Churn" feature:-
Key Observations:-
Bivariate Analysis is performed to find the relationship between Quantitative Variable and Categorical variable of dataset.
To do analysis here we are using Violin plots because Violin plot depicts distributions of numeric data for one or more groups using density curves. The width of each curve corresponds with the approximate frequency of data points in each region.
Since we have 17 categorical attributes and 3 discrete attributes, we will use subplots for better representation
# Creating Plot function for Quantitative VS Categorical Attributes
def bi_Anly(x):
# Bivariate Analysis for Quantitative VS All Categorical Attributes
print(f'\033[1m\nPlots Showing Bivariate Analysis of "{x}" VS All Categorical Attributes:-\n')
# Setting up Sub-Plots
fig, axes = plt.subplots(6, 3, figsize=(18, 20))
fig.suptitle(f'"{x}" VS All Categorical Attributes')
plt.subplots_adjust(left=0.1,bottom=0.1, right=0.9, top=0.94, wspace=0.3, hspace=0.6)
# Plotting Sub-Plots
sns.violinplot(ax=axes[0, 0], x='gender', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[0, 1], x='SeniorCitizen', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[0, 2], x='Partner', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[1, 0], x='Dependents', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[1, 1], x='PhoneService', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[1, 2], x='MultipleLines', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[2, 0], x='InternetService', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[2, 1], x='OnlineSecurity', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[2, 2], x='OnlineBackup', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[3, 0], x='DeviceProtection', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[3, 1], x='TechSupport', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[3, 2], x='StreamingTV', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[4, 0], x='StreamingMovies', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[4, 1], x='Contract', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[4, 2], x='PaperlessBilling', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[5, 0], x='PaymentMethod', y=x, data=teledata, palette='bright');
sns.violinplot(ax=axes[5, 1], x='Churn', y=x, data=teledata, palette='bright');
plt.show()
Bivariate Analysis 1: "Tenure" VS All Categorical Attributes
# Bivariate Analysis for "Tenure" VS All Categorical Attributes
bi_Anly('tenure')
Plots Showing Bivariate Analysis of "tenure" VS All Categorical Attributes:-
Key Observations:-
Bivariate Analysis 2: "MonthlyCharges" VS All Categorical Attributes
# Bivariate Analysis for "MonthlyCharges" VS All Categorical Attributes
bi_Anly('MonthlyCharges')
Plots Showing Bivariate Analysis of "MonthlyCharges" VS All Categorical Attributes:-
Key Observations:-
Bivariate Analysis 3: "TotalCharges" VS All Categorical Attributes
# Bivariate Analysis for "TotalCharges" VS All Categorical Attributes
bi_Anly('TotalCharges')
Plots Showing Bivariate Analysis of "TotalCharges" VS All Categorical Attributes:-
Key Observations:-
# Multivariate Analysis to Check Relation Between Quantitative Attributes
print('\033[1m\nPlot Showing Multivariate Analysis to check Relation between Quantitative Attributes:-')
# Getting Quantitative Attributes and creating dataframe of it
dis_att = teledata[['tenure','MonthlyCharges','TotalCharges']]
# Plotting pairplot for Quantitative Attributes
sns.pairplot(dis_att, plot_kws={'color':'#9400D3'}, diag_kws={'color':'Gold'}, size=3.5).fig.suptitle(
'Relation Between Quantitative Attributes', y=1.04);
plt.show()
Plot Showing Multivariate Analysis to check Relation between Quantitative Attributes:-
Key Observations:-
# Multivariate Analysis to check density of Target Attribute in Quantitative Attributes
print('\033[1mPlot Showing Multivariate Analysis to check Density of Target Attribute in Quantitative Attributes:-')
# Adding Target column in dis_att dataframe
dis_att['Churn'] = teledata['Churn']
# Plotting pairplot for Quantitative Attributes
sns.pairplot(dis_att, hue='Churn', palette='bright', size=3.5).fig.suptitle('Relation Between Quantitative Attributes', y=1.04);
plt.show()
Plot Showing Multivariate Analysis to check Density of Target Attribute in Quantitative Attributes:-
Key Observations:-
# Plotting Heatmap for checking Correlation
print('\033[1mHeatmap showing Correlation of Data attributes:-')
plt.figure(figsize=(22,18))
plt.title('Correlation of Data Attributes\n')
sns.heatmap(teledata.corr(),annot=True,fmt= '.2f',cmap='Spectral');
plt.show()
Heatmap showing Correlation of Data attributes:-
Key Observations:-
Here we will check for outliers in Quantitative data.
# Creating Required Columns
clm = ['tenure','MonthlyCharges','TotalCharges']
AT = []
OL = []
for i in clm:
AT.append(i)
# Getting Interquartile Range
q1 = teledata[i].quantile(0.25)
q3 = teledata[i].quantile(0.75)
IQR = q3 - q1
# Getting Outliers
ol = []
for k in teledata[i]:
if (k < (q1 - 1.5 * IQR) or k > (q3 + 1.5 * IQR)):
ol.append(k)
OL.append(len(ol))
# Creting dataframe for better representation of Outlier Analysis
Outlier_Analysis = pd.DataFrame({'Attribute':AT,
'Outliers':OL,})
print('\n\033[1mTable Showing Outlier Analysis:-')
display(Outlier_Analysis)
Table Showing Outlier Analysis:-
| Attribute | Outliers | |
|---|---|---|
| 0 | tenure | 0 |
| 1 | MonthlyCharges | 0 |
| 2 | TotalCharges | 0 |
Key Observations:-
# Creating list of Quantitative Feature column
clm = ['tenure','MonthlyCharges','TotalCharges']
# Transformation of Quantitative data
Scale = MinMaxScaler()
teledata[clm] = Scale.fit_transform(teledata[clm])
# Displaying minimum and maximum values of Quantitative attributes
display(pd.DataFrame({'Minimum':teledata[clm].min().values, 'Maximum':teledata[clm].max().values}, index = [clm]))
| Minimum | Maximum | |
|---|---|---|
| tenure | 0.0 | 1.0 |
| MonthlyCharges | 0.0 | 1.0 |
| TotalCharges | 0.0 | 1.0 |
Key Observations:-
By sperating Predictors and Target attributes, we can perform further operations easily.
# Getting Predictors by dropping Class Attribute
X = teledata.drop(columns='Churn')
# Getting Target Attribute
y = teledata['Churn']
print('\033[1mTable Showing Segregated Predictors:-')
display(X.head())
Table Showing Segregated Predictors:-
| gender | SeniorCitizen | Partner | Dependents | tenure | PhoneService | MultipleLines | InternetService | OnlineSecurity | OnlineBackup | DeviceProtection | TechSupport | StreamingTV | StreamingMovies | Contract | PaperlessBilling | PaymentMethod | MonthlyCharges | TotalCharges | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0.000000 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.115423 | 0.001275 |
| 1 | 1 | 0 | 0 | 0 | 0.464789 | 1 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0.385075 | 0.215867 |
| 2 | 1 | 0 | 0 | 0 | 0.014085 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0.354229 | 0.010310 |
| 3 | 1 | 0 | 0 | 0 | 0.619718 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 0.239303 | 0.210241 |
| 4 | 0 | 0 | 0 | 0 | 0.014085 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0.521891 | 0.015330 |
Key Observations:-
# Checking Value Counts of Target Attribute
print('\033[1mTable Showing Total Observations in each section of Target attribute:-')
TAC = y.value_counts().to_frame('Total Observations')
display(TAC)
# Getting Percentages of each category in Target Attribute
print('\033[1m\n\nPie Chart Showing Percentage of Each Category of Target Attribute:-')
plt.title('Percentage of Each Category of Target Attribute\n\n\n\n\n\n')
explode = (0.05, 0.1)
y.value_counts().plot.pie(radius=2,explode=explode,shadow=True,autopct='%1.1f%%',colors=['yellowgreen','gold']);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
Table Showing Total Observations in each section of Target attribute:-
| Total Observations | |
|---|---|
| 0 | 5163 |
| 1 | 1869 |
Pie Chart Showing Percentage of Each Category of Target Attribute:-
Key Observations:-
SMOTE first selects a minority class instance a at random and finds its k nearest minority class neighbors. The synthetic instance is then created by choosing one of the k nearest neighbors b at random and connecting a and b to form a line segment in the feature space. The synthetic instances are generated as a convex combination of the two chosen instances a and b.
# Getting total observations of target attribute before transformation
yct = y.count()
# Transforming the dataset
OS = SMOTE(random_state=1)
X, y = OS.fit_resample(X, y)
# Checking Value Counts of Target Attribute after transforming
print('\033[1mTable Showing Total Observations in each section of target data for SMOTE:-')
TAC2 = y.value_counts().to_frame('Total Observations')
# For better representation
TVC = pd.DataFrame({'Before Tranformation':TAC['Total Observations'],'After Tranformation':TAC2['Total Observations']})
total = pd.Series({'Before Tranformation':yct,'After Tranformation':y.count()},name='Total')
TVC = TVC.append(total)
columns=[('__________Total Observations__________', 'Before Tranformation'), ('__________Total Observations__________',
'After Tranformation')]
TVC.columns = pd.MultiIndex.from_tuples(columns)
display(TVC)
# Getting Percentages of each category in Target Attribute
print('\033[1m\n\nPie Chart Showing Percentage of Each Category of Target Attribute:-')
plt.title('Percentage of Each Category of Target Attribute\n\n\n\n\n\n')
explode = (0.05, 0.1)
y.value_counts().plot.pie(radius=2,explode=explode,shadow=True,autopct='%1.1f%%',colors=['yellowgreen','gold']);
plt.legend(loc='right',prop={'size': 12}, bbox_to_anchor=(2, 1))
plt.show()
Table Showing Total Observations in each section of target data for SMOTE:-
| __________Total Observations__________ | ||
|---|---|---|
| Before Tranformation | After Tranformation | |
| 0 | 5163 | 5163 |
| 1 | 1869 | 5163 |
| Total | 7032 | 10326 |
Pie Chart Showing Percentage of Each Category of Target Attribute:-
Key Observations:-
# Splitting into Train and Test Sets in
# Here test_size is not given because by default its value is 0.25.
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, stratify=y)
# For better observation of Splitted Data
TTS = pd.DataFrame({'Train':y_train.value_counts(),'Test':y_test.value_counts(),'Total Observations':y.value_counts()})
total = pd.Series({'Train':y_train.count(),'Test':y_test.count(),'Total Observations':y.shape[0]},name='Total')
TTS = TTS.append(total)
print('\033[1mTable Showing Train-Test Split of Data:-')
display(TTS)
Table Showing Train-Test Split of Data:-
| Train | Test | Total Observations | |
|---|---|---|---|
| 1 | 3872 | 1291 | 5163 |
| 0 | 3872 | 1291 | 5163 |
| Total | 7744 | 2582 | 10326 |
Key Observations:-
# Getting Statistical informations
skw1 = X.skew().to_frame('Skewness')
skw2 = X_train.skew().to_frame('Skewness')
X1 = pd.concat([X.describe()[0:3].T,skw1],axis=1)
X2 = pd.concat([X_train.describe()[0:3].T,skw2],axis=1)
Xdata = pd.concat([X1, X2], axis=1)
y1 = y.describe()[0:3].T
y2 = y_train.describe()[0:3].T
ydata = pd.concat([y1, y2],axis=1)
skw3 = pd.DataFrame({}, index=['Skewness'], columns=['Churn','Churn'])
skw3.iloc[:,:1] = y.skew()
skw3.iloc[:,1:] = y_train.skew()
ydata = ydata.append(skw3)
# Displaying Statistical Characteristics Comparison of Train data with Original data
columns1=[('______________Original Data______________', 'count'),('______________Original Data______________', 'mean'),
('______________Original Data______________', 'std'),('______________Original Data______________', 'Skewness'),
('______________X_train Data______________', 'count'),('______________X_train Data______________', 'mean'),
('______________X_train Data______________', 'std'),('______________X_train Data______________', 'Skewness')]
Xdata.columns = pd.MultiIndex.from_tuples(columns1)
columns2=[('Original Data', 'Churn'),('y_train Data', 'Churn')]
ydata.columns = pd.MultiIndex.from_tuples(columns2)
print('\033[1m\nTable showing Statistical Characteristics for Predictors Attributes:-')
display(Xdata)
print('_____________________________________________________________________________________\n')
print('\033[1m\nTable showing Statistical Characteristics for Target Attributes:-')
display(ydata)
Table showing Statistical Characteristics for Predictors Attributes:-
| ______________Original Data______________ | ______________X_train Data______________ | |||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | Skewness | count | mean | std | Skewness | |
| gender | 10326.0 | 0.484796 | 0.499793 | 0.060854 | 7744.0 | 0.486699 | 0.499855 | 0.053232 |
| SeniorCitizen | 10326.0 | 0.172284 | 0.377645 | 1.735916 | 7744.0 | 0.176395 | 0.381180 | 1.698352 |
| Partner | 10326.0 | 0.420492 | 0.493662 | 0.322178 | 7744.0 | 0.427040 | 0.494680 | 0.295053 |
| Dependents | 10326.0 | 0.241526 | 0.428029 | 1.207973 | 7744.0 | 0.243156 | 0.429016 | 1.197672 |
| tenure | 10326.0 | 0.370848 | 0.335089 | 0.541820 | 7744.0 | 0.372447 | 0.333646 | 0.529658 |
| PhoneService | 10326.0 | 0.904029 | 0.294566 | -2.743743 | 7744.0 | 0.902247 | 0.297000 | -2.709436 |
| MultipleLines | 10326.0 | 0.418071 | 0.493266 | 0.332255 | 7744.0 | 0.418518 | 0.493348 | 0.330410 |
| InternetService | 10326.0 | 1.345148 | 0.749876 | -0.659854 | 7744.0 | 1.346462 | 0.748449 | -0.661630 |
| OnlineSecurity | 10326.0 | 0.222545 | 0.415975 | 1.334256 | 7744.0 | 0.223915 | 0.416893 | 1.324830 |
| OnlineBackup | 10326.0 | 0.301763 | 0.459045 | 0.863862 | 7744.0 | 0.307464 | 0.461473 | 0.834657 |
| DeviceProtection | 10326.0 | 0.309123 | 0.462154 | 0.826193 | 7744.0 | 0.310434 | 0.462701 | 0.819601 |
| TechSupport | 10326.0 | 0.230777 | 0.421350 | 1.278155 | 7744.0 | 0.231405 | 0.421758 | 1.274022 |
| StreamingTV | 10326.0 | 0.386694 | 0.487016 | 0.465398 | 7744.0 | 0.384168 | 0.486429 | 0.476376 |
| StreamingMovies | 10326.0 | 0.390761 | 0.487945 | 0.447838 | 7744.0 | 0.395145 | 0.488913 | 0.429043 |
| Contract | 10326.0 | 0.498257 | 0.765065 | 1.129221 | 7744.0 | 0.495222 | 0.759615 | 1.135529 |
| PaperlessBilling | 10326.0 | 0.629963 | 0.482838 | -0.538435 | 7744.0 | 0.632361 | 0.482194 | -0.549135 |
| PaymentMethod | 10326.0 | 1.132675 | 1.145250 | 0.450770 | 7744.0 | 1.133135 | 1.143333 | 0.450670 |
| MonthlyCharges | 10326.0 | 0.490602 | 0.286713 | -0.375175 | 7744.0 | 0.491051 | 0.286535 | -0.377057 |
| TotalCharges | 10326.0 | 0.229490 | 0.249152 | 1.154400 | 7744.0 | 0.230961 | 0.249044 | 1.145369 |
_____________________________________________________________________________________
Table showing Statistical Characteristics for Target Attributes:-
| Original Data | y_train Data | |
|---|---|---|
| Churn | Churn | |
| count | 10326 | 7744 |
| mean | 0.5 | 0.5 |
| std | 0.500024 | 0.500032 |
| Skewness | 0 | 0 |
Key Observations:-
# Getting Statistical informations
skw1 = X.skew().to_frame('Skewness')
skw2 = X_test.skew().to_frame('Skewness')
X1 = pd.concat([X.describe()[0:3].T,skw1],axis=1)
X2 = pd.concat([X_test.describe()[0:3].T,skw2],axis=1)
Xdata2 = pd.concat([X1, X2], axis=1)
y1 = y.describe()[0:3].T
y2 = y_test.describe()[0:3].T
ydata = pd.concat([y1, y2],axis=1)
skw4 = pd.DataFrame({}, index=['Skewness'], columns=['Churn','Churn'])
skw4.iloc[:,:1] = y.skew()
skw4.iloc[:,1:] = y_test.skew()
ydata2 = ydata.append(skw3)
# Displaying Statistical Characteristics Comparison of Test data with Original data
columns1=[('______________Original Data______________', 'count'),('______________Original Data______________', 'mean'),
('______________Original Data______________', 'std'),('______________Original Data______________', 'Skewness'),
('______________X_test Data______________', 'count'),('______________X_test Data______________', 'mean'),
('______________X_test Data______________', 'std'),('______________X_test Data______________', 'Skewness')]
Xdata2.columns = pd.MultiIndex.from_tuples(columns1)
columns2=[('Original Data', 'Churn'),('y_test Data', 'Churn')]
ydata2.columns = pd.MultiIndex.from_tuples(columns2)
print('\033[1m\nTable showing Statistical Characteristics for Predictors Attributes:-')
display(Xdata2)
print('_____________________________________________________________________________________\n')
print('\033[1m\nTable showing Statistical Characteristics for Target Attributes:-')
display(ydata2)
Table showing Statistical Characteristics for Predictors Attributes:-
| ______________Original Data______________ | ______________X_test Data______________ | |||||||
|---|---|---|---|---|---|---|---|---|
| count | mean | std | Skewness | count | mean | std | Skewness | |
| gender | 10326.0 | 0.484796 | 0.499793 | 0.060854 | 2582.0 | 0.479086 | 0.499659 | 0.083778 |
| SeniorCitizen | 10326.0 | 0.172284 | 0.377645 | 1.735916 | 2582.0 | 0.159954 | 0.366634 | 1.856402 |
| Partner | 10326.0 | 0.420492 | 0.493662 | 0.322178 | 2582.0 | 0.400852 | 0.490166 | 0.404862 |
| Dependents | 10326.0 | 0.241526 | 0.428029 | 1.207973 | 2582.0 | 0.236638 | 0.425101 | 1.240016 |
| tenure | 10326.0 | 0.370848 | 0.335089 | 0.541820 | 2582.0 | 0.366053 | 0.339403 | 0.578268 |
| PhoneService | 10326.0 | 0.904029 | 0.294566 | -2.743743 | 2582.0 | 0.909373 | 0.287134 | -2.853648 |
| MultipleLines | 10326.0 | 0.418071 | 0.493266 | 0.332255 | 2582.0 | 0.416731 | 0.493113 | 0.337989 |
| InternetService | 10326.0 | 1.345148 | 0.749876 | -0.659854 | 2582.0 | 1.341208 | 0.754271 | -0.654783 |
| OnlineSecurity | 10326.0 | 0.222545 | 0.415975 | 1.334256 | 2582.0 | 0.218435 | 0.413264 | 1.363693 |
| OnlineBackup | 10326.0 | 0.301763 | 0.459045 | 0.863862 | 2582.0 | 0.284663 | 0.451341 | 0.954949 |
| DeviceProtection | 10326.0 | 0.309123 | 0.462154 | 0.826193 | 2582.0 | 0.305190 | 0.460577 | 0.846596 |
| TechSupport | 10326.0 | 0.230777 | 0.421350 | 1.278155 | 2582.0 | 0.228892 | 0.420201 | 1.291371 |
| StreamingTV | 10326.0 | 0.386694 | 0.487016 | 0.465398 | 2582.0 | 0.394268 | 0.488788 | 0.432965 |
| StreamingMovies | 10326.0 | 0.390761 | 0.487945 | 0.447838 | 2582.0 | 0.377614 | 0.484884 | 0.505195 |
| Contract | 10326.0 | 0.498257 | 0.765065 | 1.129221 | 2582.0 | 0.507359 | 0.781261 | 1.109901 |
| PaperlessBilling | 10326.0 | 0.629963 | 0.482838 | -0.538435 | 2582.0 | 0.622773 | 0.484786 | -0.506896 |
| PaymentMethod | 10326.0 | 1.132675 | 1.145250 | 0.450770 | 2582.0 | 1.131294 | 1.151199 | 0.451351 |
| MonthlyCharges | 10326.0 | 0.490602 | 0.286713 | -0.375175 | 2582.0 | 0.489254 | 0.287298 | -0.369730 |
| TotalCharges | 10326.0 | 0.229490 | 0.249152 | 1.154400 | 2582.0 | 0.225080 | 0.249475 | 1.182930 |
_____________________________________________________________________________________
Table showing Statistical Characteristics for Target Attributes:-
| Original Data | y_test Data | |
|---|---|---|
| Churn | Churn | |
| count | 10326 | 2582 |
| mean | 0.5 | 0.5 |
| std | 0.500024 | 0.500097 |
| Skewness | 0 | 0 |
Key Observations:-
# Building Decision Tree Classifier
DT = DecisionTreeClassifier(criterion='gini', max_depth=6, random_state=5)
DT.fit(X_train, y_train)
y_predict = DT.predict(X_test)
# Getting Accuracies for train and test data
Train_AC = DT.score(X_train, y_train)
Test_AC = DT.score(X_test, y_test)
# Displaying Decision Tree Classifier model accuracies for train and test Data
print('\033[1mTable Showing Decision Tree Classifier Model Accuracies for Train and Test Data:-')
display(pd.DataFrame({'Data':['Training','Testing'],'Decision Tree Classifier Accuracy (%)':
[Train_AC,Test_AC]}).set_index('Data'))
print('\n______________________________________________________________________________________\n')
# Building Confusion Matrix for Naive Bayes Model
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for Naive Bayes Model
print('\033[1m\nHeatmap Showing Performance of Decision Tree Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of Decision Tree Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
print('\n______________________________________________________________________________________\n')
# Visualizing the Decision Tree
print('\033[1m\nVisualizing the Decision Tree:-')
plt.subplots(figsize = (3.7,2.3), dpi=300)
plot_tree(DT, filled=True)
plt.title('Decision Tree', fontdict={'fontsize':8})
plt.show() # If plt.show() doesn't work, use "fig.savefig('DecisionTree.png')" to save then load.
Table Showing Decision Tree Classifier Model Accuracies for Train and Test Data:-
| Decision Tree Classifier Accuracy (%) | |
|---|---|
| Data | |
| Training | 0.795455 |
| Testing | 0.786212 |
______________________________________________________________________________________
Heatmap Showing Performance of Decision Tree Classifier:-
______________________________________________________________________________________
Visualizing the Decision Tree:-
Key Observations:-
# Building Bagging Classifier
BC = BaggingClassifier(base_estimator=DT, n_estimators=14, random_state=1)
BC.fit(X_train, y_train)
y_predict = BC.predict(X_test)
# Getting Accuracies for train and test data
BCAC1 = BC.score(X_train , y_train)
BCAC2 = BC.score(X_test , y_test)
# Building Confusion Matrix for Bagging Classifier
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for Bagging Classifier
print('\033[1m\nHeatmap Showing Performance of Bagging Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of Bagging Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Heatmap Showing Performance of Bagging Classifier:-
Key Observations:-
# Building AdaBoost Classifier
AB = AdaBoostClassifier(n_estimators=100, random_state=1)
AB.fit(X_train, y_train)
y_predict = AB.predict(X_test)
# Getting Accuracies for train and test data
ABAC1 = AB.score(X_train , y_train)
ABAC2 = AB.score(X_test , y_test)
# Building Confusion Matrix for AdaBoost Classifier
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for AdaBoost Classifier
print('\033[1m\nHeatmap Showing Performance of AdaBoost Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of AdaBoost Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Heatmap Showing Performance of AdaBoost Classifier:-
Key Observations:-
# Building Gradient Boosting Classifier
GB = GradientBoostingClassifier(n_estimators = 55,random_state=1)
GB.fit(X_train, y_train)
y_predict = GB.predict(X_test)
# Getting Accuracies for train and test data
GBAC1 = GB.score(X_train , y_train)
GBAC2 = GB.score(X_test , y_test)
# Building Confusion Matrix for Gradient Boosting Classifier
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for Gradient Boosting Classifier
print('\033[1m\nHeatmap Showing Performance of Gradient Boosting Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of Gradient Boosting Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Heatmap Showing Performance of Gradient Boosting Classifier:-
Key Observations:-
# Building RandomForest Classifier
RF = RandomForestClassifier(n_estimators=50, random_state=1, max_depth=7, max_features=12)
RF.fit(X_train, y_train)
y_predict = RF.predict(X_test)
# Getting Accuracies for train and test data
RFAC1 = RF.score(X_train , y_train)
RFAC2 = RF.score(X_test , y_test)
# Building Confusion Matrix for RandomForest Classifier
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for RandomForest Classifier
print('\033[1m\nHeatmap Showing Performance of RandomForest Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of RandomForest Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Heatmap Showing Performance of RandomForest Classifier:-
Key Observations:-
The idea behind the VotingClassifier is to combine conceptually different machine learning classifiers and use a majority vote or the average predicted probabilities (soft vote) to predict the class labels.
Here we build,
All the models are pre-checked to give best accuracies. Then we us these models to build our Voting Classifier
# Creating an empty set and a set of models
models1 = []
models2 = [DT, LogisticRegression(random_state=1,C=100), SVC(gamma='auto',random_state=1,C=10)]
# Appending models to empty set
for i in models2:
models1.append((f'{i}', i))
# Building Voting Classifier model
VC = VotingClassifier(models1)
VC.fit(X_train,y_train)
y_predict = VC.predict(X_test)
# Getting Accuracies for train and test data
VCAC1 = VC.score(X_train , y_train)
VCAC2 = VC.score(X_test , y_test)
# Building Confusion Matrix for Voting Classifier
CM = metrics.confusion_matrix(y_test, y_predict)
Con_Mat = pd.DataFrame(CM)
# Displaying Confusion Matrix for Voting Classifier
print('\033[1m\nHeatmap Showing Performance of Voting Classifier:-')
plt.figure(figsize = (7,5))
sns.heatmap(Con_Mat, annot=True, fmt=".1f")
plt.title('Confusion Matrix of Voting Classifier\n')
plt.xlabel('\nPredicted Labels\n')
plt.ylabel('Actual Labels\n')
plt.show()
Heatmap Showing Performance of Voting Classifier:-
Key Observations:-
# Getting Models and Accuracies in lists
M = ['Bagging Classifier','AdaBoost Classifier','Gradient Boosting Classifier','RandomForest Classifier','Voting Classifier']
Train = list(map(lambda x: round(x*100,2) , [BCAC1, ABAC1, GBAC1, RFAC1, VCAC1]))
Test = list(map(lambda x: round(x*100,2) , [BCAC2, ABAC2, GBAC2, RFAC2, VCAC2]))
# Displaying Classification Accuracies of Ensemble Models for Train and Test Data.
print('\033[1mTable Showing Ensemble Models Classification Accuracies for Train and Test Data:-')
all_models1 = pd.DataFrame({'Ensemble Models':M,'Train Accuracy (%)':Train,'Test Accuracy (%)':Test}
).set_index('Ensemble Models')
display(all_models1)
Table Showing Ensemble Models Classification Accuracies for Train and Test Data:-
| Train Accuracy (%) | Test Accuracy (%) | |
|---|---|---|
| Ensemble Models | ||
| Bagging Classifier | 81.04 | 79.74 |
| AdaBoost Classifier | 79.49 | 79.32 |
| Gradient Boosting Classifier | 80.18 | 80.13 |
| RandomForest Classifier | 82.93 | 81.53 |
| Voting Classifier | 81.79 | 81.25 |
Key Observations:-
List of different Models Applied to our Data:-
All the models are pre-checked to give best accuracies
# Creating list of pre-checked models with best parameters
models = [DT, DecisionTreeClassifier(criterion='entropy', max_depth=6, random_state=5), RF,
RandomForestClassifier(criterion='entropy',n_estimators=50, random_state=1, max_depth=7, max_features=12),
LogisticRegression(random_state=1,C=100), GaussianNB(), SVC(gamma='auto',random_state=1,C=10),
SVC(kernel='linear',gamma='auto',random_state=1,C=50), KNeighborsClassifier(metric='euclidean',n_neighbors=20)]
Train_Accuracy = []
Test_Accuracy = []
# Training models and getting accuracies for Train and Test Data
for i in models:
i.fit(X_train,y_train)
Train_Accuracy.append(round(i.score(X_train,y_train),4)*100)
Test_Accuracy.append(round(i.score(X_test,y_test),4)*100)
Key Observations:-
# Displaying Accuracies of Train and Test Data for trained models
print('\033[1mTable Showing Accuracies of Train and Test Data from Varius Algorithms:-')
model = ['Decision Tree Classifier(gini)','Decision Tree Classifier(entropy)','Random Forest Classifier(gini)',
'Random Forest Classifier(entropy)','Logistic Regression','Gaussian Naive Bayes','Support Vector Classifier(rbf)',
'Support Vector Classifier(linear)','K-Neighbors Classifier']
all_models2 = pd.DataFrame({'Trained Model':model,'Train Accuracy (%)':Train_Accuracy,'Test Accuracy (%)':Test_Accuracy}
).set_index('Trained Model')
display(all_models2)
Table Showing Accuracies of Train and Test Data from Varius Algorithms:-
| Train Accuracy (%) | Test Accuracy (%) | |
|---|---|---|
| Trained Model | ||
| Decision Tree Classifier(gini) | 79.55 | 78.62 |
| Decision Tree Classifier(entropy) | 79.39 | 78.70 |
| Random Forest Classifier(gini) | 82.93 | 81.53 |
| Random Forest Classifier(entropy) | 82.24 | 80.79 |
| Logistic Regression | 80.06 | 80.64 |
| Gaussian Naive Bayes | 76.60 | 77.65 |
| Support Vector Classifier(rbf) | 81.96 | 81.18 |
| Support Vector Classifier(linear) | 80.09 | 80.52 |
| K-Neighbors Classifier | 79.82 | 79.09 |
# Comparing Accuracies of Train and Test data for All the Trained Models so far
print('\033[1mComparing Accuracies of Train and Test Data for All the Trained Models so far:-')
all_models1.drop('RandomForest Classifier',inplace=True)
display(all_models1.append(all_models2))
Comparing Accuracies of Train and Test Data for All the Trained Models so far:-
| Train Accuracy (%) | Test Accuracy (%) | |
|---|---|---|
| Bagging Classifier | 81.04 | 79.74 |
| AdaBoost Classifier | 79.49 | 79.32 |
| Gradient Boosting Classifier | 80.18 | 80.13 |
| Voting Classifier | 81.79 | 81.25 |
| Decision Tree Classifier(gini) | 79.55 | 78.62 |
| Decision Tree Classifier(entropy) | 79.39 | 78.70 |
| Random Forest Classifier(gini) | 82.93 | 81.53 |
| Random Forest Classifier(entropy) | 82.24 | 80.79 |
| Logistic Regression | 80.06 | 80.64 |
| Gaussian Naive Bayes | 76.60 | 77.65 |
| Support Vector Classifier(rbf) | 81.96 | 81.18 |
| Support Vector Classifier(linear) | 80.09 | 80.52 |
| K-Neighbors Classifier | 79.82 | 79.09 |
Key Observations:-
Based on above results, RandomForest Classifier(gini) has highest accuracies for train and test data.
# Getting Classification Report for RandomForest Classifier(gini) Model
RF_Predict = RF.predict(X_test)
RFCR = metrics.classification_report(y_test, RF_Predict, output_dict=True)
# Displaying RandomForest Classifier(gini) Model Classification Report
print('\033[1m\nTable Showing RandomForest Classifier(gini) Model Classification Report:-')
display(pd.DataFrame(RFCR))
Table Showing RandomForest Classifier(gini) Model Classification Report:-
| 0 | 1 | accuracy | macro avg | weighted avg | |
|---|---|---|---|---|---|
| precision | 0.852076 | 0.785414 | 0.815259 | 0.818745 | 0.818745 |
| recall | 0.762974 | 0.867545 | 0.815259 | 0.815259 | 0.815259 |
| f1-score | 0.805067 | 0.824439 | 0.815259 | 0.814753 | 0.814753 |
| support | 1291.000000 | 1291.000000 | 0.815259 | 2582.000000 | 2582.000000 |
# Get Feature Importance from the RF classifier
fi = RF.feature_importances_
# Displaying Feature Importance
FI = pd.Series(fi, index=teledata.columns[1:]).sort_values()
print('\033[1mPlot showing Feature Importance of RandomForest Classifier(gini):-')
FI.plot(kind='barh', figsize=(10,10), color= '#9400D3')
plt.title('Feature Importance of RandomForest Classifier\n')
plt.show()
Plot showing Feature Importance of RandomForest Classifier(gini):-
Comments:-
By observing above data, we select the RandomForest Classifier (gini) as our Final Best Trained Model.
Pickle is the standard way of serializing objects in Python.
Here we create a SAV file and save our Model in it.
# Creating file name
Pickle_file = 'PickleFile.sav'
# Saving the Final best trained Random Forest model to the file
pickle.dump(RF, open(Pickle_file, 'wb')) # Here we use "dump"
# Loading the Pickle File to get Random Forest Model
RF_model1 = pickle.load(open(Pickle_file, 'rb')) # Here we use "load"
# Checking the loaded model by finding accuracies
Train_AC = round(RF_model1.score(X_train , y_train),4)*100
Test_AC = round(RF_model1.score(X_test , y_test),4)*100
print('\033[1mTable Showing Random Forest Classifier Model Accuracies for Train and Test Data:-')
display(pd.DataFrame({'Data':['Training','Testing'],'Random Forest Classifier Accuracy (%)':
[Train_AC,Test_AC]}).set_index('Data'))
Table Showing Random Forest Classifier Model Accuracies for Train and Test Data:-
| Random Forest Classifier Accuracy (%) | |
|---|---|
| Data | |
| Training | 82.93 |
| Testing | 81.53 |
Incase if the above method fails due to error in loading file, you can check this method.
# Saving the Final best trained Random Forest model to a Variabe
Pickle_model = pickle.dumps(RF) # Here we use "dumps" NOT "dump"
# Loading the Pickle Model
RF_model2 = pickle.loads(Pickle_model) # Here we use "loads" NOT "load"
# Checking the loaded model by finding accuracies
Train_AC = round(RF_model2.score(X_train , y_train),4)*100
Test_AC = round(RF_model2.score(X_test , y_test),4)*100
print('\033[1mTable Showing Random Forest Classifier Model Accuracies for Train and Test Data:-')
display(pd.DataFrame({'Data':['Training','Testing'],'Random Forest Classifier Accuracy (%)':
[Train_AC,Test_AC]}).set_index('Data'))
Table Showing Random Forest Classifier Model Accuracies for Train and Test Data:-
| Random Forest Classifier Accuracy (%) | |
|---|---|
| Data | |
| Training | 82.93 |
| Testing | 81.53 |
Closing Sentence:- The Predictions made by our models will help the company to understand the pinpoints and patterns of customer churn and will increase the focus on strategising customer retention.
------------------------------------------------------------------------------THANK YOU😊----------------------------------------------------------------------------------